{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "001361c5",
   "metadata": {},
   "source": [
    "# 决策树\n",
    "        决策树是一种简单高效并且具有强解释性的模型，每个决策或事件都可能引出两个或多个事件，得到不同的结果。\n",
    "        运用概率事先演化事物发展的可能路径，就会得到类似枝干的图形，简称决策树。\n",
    "        \n",
    "        决策树本质上是一系列 if/else 语句，通过不断的问询特征来分类。\n",
    "        \n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e8bacbc",
   "metadata": {},
   "source": [
    "## 1. 信息熵\n",
    "**信息熵是信息论中用于度量信息量的一个概念。** 一个系统越是有序，信息熵就越低；反之，一个系统越是混乱，信息熵就越高。\n",
    "\n",
    "所以，信息熵也可以说是系统有序化程度的一个度量。\n",
    "        \n",
    "        比如说，我们要搞清楚一件非常不确定的事，或是我们一无所知的事情，就需要了解大量的信息。\n",
    "        相反，如果我们对某件事已经有了较多的了解，就不需要太多的信息就能把它搞清楚。\n",
    "        信息熵也叫香农熵。\n",
    "        \n",
    "**一个整体中内部元素的混乱程度，元素中越多混乱程度越高。**\n",
    "\n",
    "<img src='img/信息熵.jpg' width='400px' />\n",
    "\n",
    "example：\n",
    "> 小明明天出去玩的概率为 0.6 ，不出去完的概率是 0.4，那么信息熵是 0.9709508\n",
    "\n",
    "<img src='img/信息熵计算.jpg' width='400px' />\n",
    "\n",
    "\n",
    "（1）负号：确保信息一定是 正数 或是 0\n",
    "（2）底数为 2：我们只需要信息量满足低概率事件 x 对应高的信息量，对数的选择是任意的，普遍使用 2 作为对数的底。\n",
    "\n",
    "**条件熵**：满足某个条件下的熵，这个条件是指某个变量确定的情况。\n",
    "\n",
    "\n",
    "\n",
    "***\n",
    "<br>\n",
    "\n",
    "## 2. 信息增益\n",
    "        为了衡量熵的变化，引入信息增益的概念\n",
    "   \n",
    "   **信息增益 = 熵 - 条件熵**\n",
    "        \n",
    "        信息增益的含义就是在选定特征 A 后，数据的不确定度的下降。它表示 特征（变量）使得 类（因变量）的不确定性关系减少的程度。\n",
    "        \n",
    "   **信息增益越大，意味着这个特征分类的能力越强，我们就要优先选择这个特征。**\n",
    "   \n",
    "   【具体解释 1 ：信息熵】\n",
    "         \n",
    "         明天下雨的信息熵是 2 ，条件熵是 0.01，（条件：明天是阴天，阴天下雨的概率很大，信息就少了），2 - 0.01 = 1.99.\n",
    "         在获得阴天这个信息后，下雨信息不确定性减少了 1.99，不确定减少了很多，所以信息增益大。\n",
    "         \n",
    "         也就是说阴天这个信息对于下雨来说很重要。\n",
    "   \n",
    "   【具体解释 2 ：条件熵】\n",
    "   \n",
    "       条件熵是表示直到某一条件后，某一随机变量的复杂性或不确定性：\n",
    "       一件衣服买的不确定性（熵）：2.6\n",
    "        看了衣服评价后买的不确定性：1.2 → Gain1 = 2.6-1.2 = 1.4\n",
    "       √实体店试穿后：0.9 → Gain2 = 2.6-0.9 = 1.7\n",
    "       \n",
    "       很显然试穿后对于决定买的不确定度下降更多。\n",
    " \n",
    " **我们通常采用信息增益这一指标作为节点决策特征选择的标准。** 决策树ID3算法就是使用信息增益（Info-Gain）.\n",
    " \n",
    "***\n",
    "\n",
    "<br>\n",
    "\n",
    "## 3. 信息增益比\n",
    "\n",
    "信息增益这个指标对可取值数目较多的特征有所偏好，为了减少这种偏好可能带来的不利影响。\n",
    "    \n",
    "信息增益不好的原因是对于变量值多的特征，其信息增益很大。\n",
    "\n",
    "但是一个特征的变量取值很多，则它本身的信息量即熵也很大。\n",
    "\n",
    "因此，可以用信息增益再除以该特征的熵就可以得到信息增益率，也称**信息增益比**。决策树C4.5算法用的就是信息增益率。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "42e82c13",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXIAAAD4CAYAAADxeG0DAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAAdoklEQVR4nO3deXydVb3v8c9v7ww7w848NmMnOtCBQoCWltZSLqIgk4jIYRARDsqgV196UK73eK/H61FxukdferyAE3BUBgUVwTIrQ0taKB3SUlo6pG2atGnGZt7r/rF3YiekNDt58mR/369XX83z7GQ9v/1q883KetazljnnEBER/wp4XYCIiAyPglxExOcU5CIiPqcgFxHxOQW5iIjPJXlx0YKCAlddXe3FpUVEfGvVqlX7nHOFR573JMirq6upra314tIiIr5lZtuPdV5DKyIiPqcgFxHxOQW5iIjPKchFRHxOQS4i4nMKchERn1OQi4j4nIJcRMTnfBXk9/ztba7/2UqvyxARGVN8FeQ7mw9Su/2A12WIiIwpvgrypIAxENGORiIih/JVkAeDCnIRkSP5KsjVIxcROZqvgrw0O42TJ2ShDaNFRP7Ok2VsT9TV86u4en6V12WIiIwpvuqRi4jI0XwV5A+vqucDP/grXb0DXpciIjJm+CrIDxzspW5PG32RiNeliIiMGXEJcjPLMbOHzGyjmdWZ2YJ4tHukpIABMDCgm50iIoPidbPzB8ATzrnLzSwFSI9Tu4cJBqM/d/o1BVFEZMiwg9zMsoDFwMcBnHO9QO9w2z2WoR65glxEZEg8hlYmAU3Az8zsNTO728wyjvwkM7vJzGrNrLapqemELlSSFeKsyfkEY4EuIiJgw324xsxqgFeAhc65FWb2A6DNOfeVd/qampoaV1tbO6zriogkGjNb5ZyrOfJ8PHrk9UC9c25F7Pgh4NQ4tCsiIsdh2EHunGsAdprZtNipZcCG4bZ7LM9tamTxt55la1PHSDQvIuJL8Zq1chtwf2zGylbg+ji1e5juvgg7mg/S3ad55CIig+IS5M6514Gjxm3iLahZKyIiR/HVk51D0w+1+qGIyBBfBXly7IGg3n4NrYiIDPJVkBeGUzl3RjHhkK9W3xURGVG+SsRpJWHuvm7Eh+JFRHzFVz1yERE5mq+CfGfzQWr+bTl/WLPb61JERMYMXwV5MGDs6+ils6ff61JERMYMXwV5alJs1sqAZq2IiAzyV5AnBwHo0ZOdIiJDfBXkKbF55D392rNTRGSQr4I8OWhcfMoEphSFvS5FRGTM8NU8cjPjB1fO87oMEZExxVc9chEROZrvgnzJt5/ly79b63UZIiJjhu+CPOIc3b262SkiMsh3QZ6aFKRHqx+KiAzxYZAHNP1QROQQPg1y9chFRAb5avohwAdnlw5t+SYiIj4M8k+ePcnrEkRExhTfDa045+ju0xi5iMgg3wX55367hvO+94LXZYiIjBm+C/L0lCAdWo9cRGSI74I8M5SkIBcROYT/gjwlid7+CL2agigiAvgxyEPRiTba7k1EJMp3QX5KRQ63Lp1CUlBzyUVEwIfzyOdV5jKvMtfrMkRExgzf9cgHIo79HT2aSy4iEuO7IF9T38Jp//YUL2/d73UpIiJjgu+CPDNVNztFRA4VtyA3s6CZvWZmf4xXm8cyGOQd3QpyERGIb4/8M0BdHNs7psHph3ooSEQkKi5BbmblwAXA3fFo7x8JpyYRDBgHDvaO9KVERHwhXtMPvw98EQi/0yeY2U3ATQCVlZUnfCEz447zpzO7PPuE2xARGU+G3SM3swuBRufcqn/0ec65nzrnapxzNYWFhcO65o2LJzF/Uv6w2hARGS/iMbSyELjIzLYBvwbOMbP74tDuO2pq72H7/s6RvISIiG8MO8idc19yzpU756qBK4FnnHNXD7uyf+BLj7zBzfetHslLiIj4hu/mkQPkpKfQopudIiJAnNdacc49BzwXzzaPJTc9WbNWRERifNsj7+6L0NWr9VZERHwZ5IXhVAD2dfR4XImIiPd8GeRnTszjW5fPISst2etSREQ857v1yAGq8jOoys/wugwRkTHBlz3ySMSxZmcLO/Yf9LoUERHP+TLIzeDyn7zEAyt3eF2KiIjnfBrkRlE4RGNbt9eliIh4zpdBDlCUlcredgW5iIhvg7w4HGJvm6Yfioj4N8izUtmroRUREX9OPwS48oxKzju5BOccZuZ1OSIinvFtkM8ozfK6BBGRMcG3QyvdfQM8sa6BrU0dXpciIuIp3wZ570CEm+9bxVN1e70uRUTEU74N8qxQMjnpyexo1tOdIpLYfBvkAJV56exo7vK6DBERT/k6yCty09mhvTtFJMH5OsgnF2awo/kg3X3aYEJEEpdvpx8CXHVmFZeeWk5K0Nc/j0REhsXXQV6SHfK6BBERz/m+K3v/iu08ub7B6zJERDzj+yD/xUvbeLB2p9dliIh4xvdBPrU4zKa97V6XISLiGd8H+bTiMDubuzjY2+91KSIinvB9kM+MLZ61YXebx5WIiHjD90E+pyIbM9japAeDRCQx+Xr6IUBROMQb/3oe4VCy16WIiHjC9z1yQCEuIgltXAT5qu3NXHfvSvZ1aA9PEUk84yLI+wccz7/ZxOs7WrwuRURk1A07yM2swsyeNbM6M1tvZp+JR2HvxdyKHFKSAryydf9oX1pExHPxuNnZD3zeObfazMLAKjNb7pzbEIe2j0soOciplTm8rCAXkQQ07B65c26Pc2517ON2oA4oG26779WCSQVs2NNGy8He0b60iIin4jpGbmbVwDxgxTFeu8nMas2stqmpKZ6XBWDR1HxOr85jf6eCXEQSiznn4tOQWSbwPPB159wj/+hza2pqXG1tbVyuKyKSKMxslXOu5sjzcemRm1ky8DBw/7uF+Ehr7eojXj+cRET8IB6zVgy4B6hzzn13+CWduGc27uW0ry2nbo9WQxSRxBGPHvlC4BrgHDN7Pfbng3Fo9z2bXZbDgHM8VbfXi8uLiHhi2NMPnXN/AywOtQxbYTiVUypyeLpuL7cvm+p1OSIio2JcPNl5qHNnFLOmvpXdLV1elyIiMirGXZBfOKcUgMfW7Pa4EhGR0eH7ZWyPVJWfwTc/PJuFUwq8LkVEZFSMuyAH+OjplV6XICIyasbd0MqgJ9bt4eFV9V6XISIy4sZtkD+yehdff7yOnv4Br0sRERlR4zbIr1lQRXNnL39e2+B1KSIiI2rcBvnCyQVMLMjgV69s97oUEZERNW6DPBAwrp5fxartB1i/u9XrckRERsy4DXKAy08tZ1pxmAOdfV6XIiIyYsbl9MNB2enJPPHZs4mu6yUiMj6N6x45gJnR3Teg/TxFZNwa90EO8N3lb3LtPStpaO32uhQRkbhLiCC/Zn4VAN9b/qbHlYiIxF9CBHlFXjrXLKjiwVU72dSgTSdEZHxJiCAHuHXpFDJSk/jmExu9LkVEJK4SJshzM1K4dekUmjt76ezp97ocEZG4GdfTD490w6KJ3LR4kqYjisi4kjA9coCkYAAzY19HD0+u1xosIjI+JFSQD/r2E5u47YHX2NLU4XUpIiLDlpBB/vn3n0QoOcAXHlxD/0DE63JERIYlIYO8KBzia5fMYvWOFn783BavyxERGZaEDHKAi08p4+JTJvD9pzezZmeL1+WIiJywhJq1cqT/ffEs0lOSKM9N87oUEZETltBBnp2WzDcumw1Ab3+EpIARCGhqooj4S8IOrRyqo6efK/7zZX78vMbLRcR/FORARkqQyrx07vrLJp7d2Oh1OSIi74mCnOia5f/+4dnMLM3ilgdWs26XtoYTEf9QkMekpyTxs4+fTm56Ctf//FV2Nh/0uiQRkeOiID9EUVaIX3zidCZkhxiIOK/LERE5Lgk9a+VYphSF+f0tCzEzIhFHZ28/4VCy12WJiLyjuPTIzex8M9tkZm+Z2R3xaNNLg6sjfuXRdVz1/1bQ2tXncUUiIu9s2EFuZkHgR8AHgJnAx8xs5nDbHQvOnVHMxoY2rr13pcJcRMasePTIzwDecs5tdc71Ar8GLo5Du55bOr2IH111Kht2t/LR/3yZxjZt3iwiY088grwM2HnIcX3s3GHM7CYzqzWz2qampjhcdnScd3IJ9378dHY0H+Tae1fqJqiIjDnxuNl5rGfaj0o759xPgZ8C1NTU+CoNz55ayP2fPJP27n6CeoRfRMaYeAR5PVBxyHE5sDsO7Y4p8ypzhz7+1Svbyc9I4YOzSz2sSEQkKh5DK68CU81sopmlAFcCj8Wh3TFpIOL4w+u7+fT9q/m/T2/GOV/9ciEi49Cwg9w51w/cCjwJ1AG/dc6tH267Y1UwYPzyhjO4bF4Z313+Jrc8sJq2bs1oERHvxOWBIOfc48Dj8WjLD0LJQb5zxVyml4b55hOb2NjwIo/ffjah5KDXpYlIAtKTnSfIzLhp8WTmVeayYXebQlxEPKO1Vobp9Oo8rjurGoDn32ziU/etYn9Hj7dFiUhCUZDH0a4DXTxd18j7v/8Cyzfs9bocEUkQCvI4uurMSh67bSGF4RA3/rKWLzy4Ro/2i8iIU5DH2fSSLH5/y1l8+n2TeXh1vXrmIjLidLNzBKQmBfni+dO5ZF4ZUwozAXh2YyNTijKpyEv3uDoRGW8U5CPopOIwAL39Eb70yFpaunq5dekUPnn2JM1yEZG40dDKKEhJCvDIp8/ifScVcddf3mTZd57nj2/s1lOhIhIXCvJRMiEnjZ9ccxoP3HgmWWnJ3PrAa6zVJs8iEgcaWhllZ00u4I+3LeLFt/YxpzwHgF+v3EFNdS5TisLeFicivqQg90AwYCw+qRCAjp5+vvHnjbR393HpvHI+e+5U3RAVkfdEQysey0xN4pnPL+GGRRP54xu7WXrXc9z5u7U0tms3IhE5PgryMSA/M5U7L5jJ819YypVnVPDI6l30DURvhPYNRDyuTkTGOvNi5kRNTY2rra0d9ev6RWtXH9lpyQBcd+9K0pKDfHrp5KExdRFJTGa2yjlXc+R59cjHoMEQj0Qcc8qzeXHLPi764Yv8092v8NSGvUS0b6iIHEJBPoYFAsbnz5vGS3ecwx0fmM6Wxk4++cta7lux3evSRGQM0awVHwiHkrl5yWRuWDSRJ9c3sGhKAQB/emMPr25r5toFVUyKLQUgIolHQe4jycEAF86ZMHS8pamD+1ds5+cvbWP+pDw+dkYl7z+5RI//iyQY3ez0ucb2bh6srefXr+5gZ3MXZ03O54Eb53tdloiMgHe62akeuc8VhUPcsnQKn1oymZe27McR/cHc3t3Hzfet4oLZE7hgdinZ6ckeVyoiI0VBPk4EAsaiqQVDxzubu2ho7ebLv1vLVx9bzznTi7j01DKWTisiJUn3uEXGE31Hj1MzJ2Tx1OeW8IdbF/FP8yup3d7MP/9qFTuaOwFo6+7TNEaRcUI98nHMzJhdns3s8mzu/OAMarcfGFqY687fraN2WzMfmFXKBXNKmFeRSyBgHlcsIidCQZ4gkoIB5k/KHzp+/8nFdPX2c98r27n3xbcpyQpx3VnVfOp9kz2sUkROhII8QV04ZwIXzplAe3cfT9c18qe1e+jqGwCi67t8888bWTKtkDMn5mtMXWSM0/RDOcr63a18+Mcv0d0XITM1iSUnFbJsRhHLZhQPLR8gIqNPa63IcTt5QjavfeU87rmuhg/NLWXltmY+99s1bGpoB2Bn80G2NnV4XKWIDNLQihxTWkqQZTOKWTajmK9HHG/samXWhCwAfvbiNu598W0q89JZfFIBi6cWctaUAjJT9d9JxAv6zpN3FQgYp1TkDB3fcPZEqgvSeX5TE4+s3sV9r+ygMJzKyi8vw8xoaO2mKJyqWTAio0RBLu9ZWU4a1y6o5toF1fT2R1i1/QCN7d2YRYP7oz99mY7ufhZNLWDhlAIWTMrX9nUiI0hBLsOSkhRgweS/T2uMRByfPXcqL7y5j79ubuLR13cDcOPZE7nzgpk452ho66Y0O82rkkXGnWEFuZl9G/gQ0AtsAa53zrXEoS7xqUDAuHReOZfOK8c5x+bGDl7esp9pJdEHkd5q7OC/fe8FqvLTWTApnwWT85k/KZ/irJDHlYv417CmH5rZecAzzrl+M/smgHPuX97t6zT9MHHt6+jh0dd38/KW/ax4ez/t3f0A/PITZ7D4pEIa27pp6+5jcmHm0FCNiESNyOqHzrm/HHL4CnD5cNqT8a8gM5UbFk3khkUTGYg46va08fKW/cwpzwbgodX1fOuJTeSmJ3NaVS6nVeVxenUup1TkkBTUbFmRY4nnGPkngN+804tmdhNwE0BlZWUcLyt+FQwYs8qymVWWPXTuorkTKMhIpXZ7M7XbDvBUXSMpwQBvfPU8koLw7MZGegcinFKRo+EYkZh3HVoxs6eAkmO8dKdz7tHY59wJ1ACXueMYq9HQihyv/R09bG7sGFon5oqfvMzKbc0AlGSFmFuRzaKphVwzv8rLMkVGxQkPrTjnzn2Xhq8DLgSWHU+Ii7wX+Zmp5GemDh3/8oYzWL+7jTU7W1hT38KanS0MRNxQkN/w81fJy0hhbkUOp1TkMK0kTLKGZGScG+6slfOBfwGWOOcOxqckkXcWSg7Gxs5zh871DUQA6B+I4ICnNzby4Kp6AFKTAty+bCq3LJ1CJOLYsKeNqcWZpCZpX1MZP4Y7Rv5DIBVYHpth8Ipz7uZhVyXyHgz2uJOCAe79+Ok456g/0MXrO1t4fWcLJxVHpz5u29/Jhf/xN5KDxtSiMLPKsphVls0504soz9UDS+Jfw521MiVehYjEi5lRkZdORV46H5o7Yeh8QTiVH111Kut2t7JuVytP1TXy29p6isIhynPTeaO+hXv/9jazyrKZUZrFtJIwBYcM64iMVXqyUxJGViiZC+aUcsGcUoChp0yzQtGleRtau3llazO/jz2NCtHpkr/55/lMLsxkx/6DtHX3MaUok1CyhmZk7FCQS8Iys8OWCjjv5BLOO7mEfR09bGpop25PGxsb2inNjk5zfGDlDn7y/BaCAaM6P53ppVnMKAlz0+LJ2nxDPKUgFzlCQWYqBVNSWTil4LDzV8+vZHZZNpsa2qhraOeN+hZeemsftyyNjjB+5ffrWLurlalFmUwpymRqcSZTi8JaMExGnIJc5DiV56ZTnps+NDQD0N03MLSUQFluGm81dvDcm01Ds2aml4R54rOLAfjhM5sxs2jIF2VSmZeup1UlLhTkIsNw6Fj5zUsmc/OS6ObVLQd7eauxg57+yNDrT67fy9pdrUPHKcEAH6kp5+uXzgbgqQ17Kc0JMbEgg/QUfWvK8dP/FpERkJOeQk113mHn/nDbIjp6+tnS2MHmxg42N7YzuSATgJ7+AW78VS2Dj9SVZkcD/YqaCi6ZV8ZAxLGz+SDluWnqxctRFOQioygzNYm5FTnMPWTHJYCkQIDHbz+bt/d1srWpg637Otna1ElHT3R1yN0tXbzvrudIDhqVeelMLMhkUmEGF82dwKyybCIRhxlaMTJBKchFxoBgwJhRmsWM0qxjvp6Vlsy3L5/D1n2dvN3UydZ9HbywuYk55dFFx2q3H+CGn79KZX46VfnpVOZlUJWfzrLpRRRpcbFxT0Eu4gPZacl8pKbisHMDEUckNhaTk57MZaeWsW3/Qer2tLN8w176BhwPf2oBRVkh/vTGHr6zfBNVeelU5WdQmRcN/AWT8zUePw7oX1DEp4IBI0h0KOWk4jD/6+JZQ68NRBy7W7ooDEefTM1JT2ZacZjt+w/y6rYDQ0M2L91xDukpSdy/YjuPvb6b6vwMKvPTqcxLpzw3jTnlOQS1ifaYpyAXGYeCATts/vrCKQVD8+KdczR39rK9+SAlsWGX5ECA/ojj6Y2N7OvoGWpj09fOB4z/eHozq3YcoDw3LTYNM43KvHTmlOeM9luTY1CQiyQYMztqeeArTq/gitOjQzedPf3UH+iisb17aIaMWXSbvtd3ttBysA+Izqx5+UvLAPifj65jR/NBKmIhX56bzsSCDGZOOPaYv8SXglxEDpORmsS0kvDQhtkAt54zlVvPmQpAe3cfu1q6hvZbheic+Kb2Hl7b0UJrVzToT6vK5eFPnQXAzb9axcG+ASZkhyjNTmNCTohpJWH16ONEQS4i70k4lMz0kuTDzv2PC2cOfdzW3ceuA130D7hDviaJPa1dbNjdNjR0c+GcUn541akAnHPXc6SnBinNTqMsJxr0g3u2QnTMX2P170xBLiJxlRVKJqv08KD/9kfmDn3c0z9AQ2v30PFAxHHmpHz2tHaxfX8nr2zZT3tPPzctnsRpVXl09Q4w66tPUpIVojQ7xIScNEpzQpw3s5jTqvLoH4iwv7OXgszUhA17BbmIjKrUpCBV+RlDx8GA8Y3LZh/2OW3dfUQi0R59fyTCzUsmsaelm10t0Q1D/ryui+JwiNOq8ti2/yDnfvd5ggGjMDOV4uwQJVmpXL9wIvMn5dPa1cf6Xa2x8yEyUsdf7I2/dyQivje4RjxEh3K+8P7ph70eiTj6Y0Gfl5HC1y6Zxd7Wbhrautnb1s3Wpk46Y1Ms19a3cvU9K/7eXmoSxdkhvnHZbE6vzuPtfZ38bXMTxVkhSmJhn++z3r2CXER8JxAwUmJBm5eRMrT59rHMLs/mgRvPZG9bNw2tPbG/u8lOi/6wqN3WzFceXX/Y1wQDxmO3LuTkCdn8dXMTj69toCicSlFWKkXhEEXhVGaUZo2ZdegV5CIyrmWnJXPW5IJ3fP2yU8tZclIhDbGA39sW7dlPiG06srO5i+UbGtjf2Tu0qBnAyjuXURQOcfdft/LQqnqKskIUZg6GfSpXz68iORigtauPlGCAtJSR21VKQS4iCS0YMIqyQhRlhZhTfvTrV51ZyVVnVtI3EGF/Ry+N7d00tvWQnxGdh1+QmUp5bhqN7T1s3ttOU3t0Vs51C6oB+D9/quM3tTsJpyZx4+JJ3L5satzfg4JcROQ4JAcD0TH07MMXIbtkXhmXzCsbOo5EHK1dfQRiQz8XnTKByvx0mtp7OKk4zEhQkIuIxFEgYORmpAwdH7o8wohdc0RbFxGREacgFxHxOQW5iIjPKchFRHxOQS4i4nMKchERn1OQi4j4nIJcRMTnzB26eMBoXdSsCdh+gl9eAOyLYzl+oPecGPSeE8Nw3nOVc67wyJOeBPlwmFmtc67G6zpGk95zYtB7Tgwj8Z41tCIi4nMKchERn/NjkP/U6wI8oPecGPSeE0Pc37PvxshFRORwfuyRi4jIIRTkIiI+55sgN7PzzWyTmb1lZnd4Xc9IM7MKM3vWzOrMbL2ZfcbrmkaLmQXN7DUz+6PXtYwGM8sxs4fMbGPs33uB1zWNNDP777H/1+vM7L/MLPTuX+U/ZnavmTWa2bpDzuWZ2XIz2xz7O3e41/FFkJtZEPgR8AFgJvAxM5vpbVUjrh/4vHNuBjAfuCUB3vOgzwB1Xhcxin4APOGcmw7MZZy/dzMrA24Hapxzs4AgcKW3VY2YnwPnH3HuDuBp59xU4OnY8bD4IsiBM4C3nHNbnXO9wK+Biz2uaUQ55/Y451bHPm4n+s1d9o+/yv/MrBy4ALjb61pGg5llAYuBewCcc73OuRZPixodSUCamSUB6cBuj+sZEc65F4DmI05fDPwi9vEvgEuGex2/BHkZsPOQ43oSINQGmVk1MA9Y4XEpo+H7wBeBiMd1jJZJQBPws9hw0t1mluF1USPJObcLuAvYAewBWp1zf/G2qlFV7JzbA9EOG1A03Ab9EuR2jHMJMW/SzDKBh4HPOufavK5nJJnZhUCjc26V17WMoiTgVODHzrl5QCdx+FV7LIuNCV8MTAQmABlmdrW3VfmbX4K8Hqg45Liccfqr2KHMLJloiN/vnHvE63pGwULgIjPbRnT47Bwzu8/bkkZcPVDvnBv8beshosE+np0LvO2ca3LO9QGPAGd5XNNo2mtmpQCxvxuH26BfgvxVYKqZTTSzFKI3Rh7zuKYRZWZGdNy0zjn3Xa/rGQ3OuS8558qdc9VE/42fcc6N656ac64B2Glm02KnlgEbPCxpNOwA5ptZeuz/+TLG+Q3eIzwGXBf7+Drg0eE2mDTcBkaDc67fzG4FniR6h/te59x6j8saaQuBa4C1ZvZ67NyXnXOPe1eSjJDbgPtjnZStwPUe1zOinHMrzOwhYDXR2VmvMU4f1Tez/wLeBxSYWT3wr8C/A781sxuI/lD7yLCvo0f0RUT8zS9DKyIi8g4U5CIiPqcgFxHxOQW5iIjPKchFRHxOQS4i4nMKchERn/v/LYaV1l8ZqJUAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "a = np.arange(0,10,0.01)\n",
    "b = -np.log(a)/np.log(2)\n",
    "plt.plot(a,b,'--')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad9a7023",
   "metadata": {},
   "source": [
    "## 4. 基尼指数\n",
    "        ID3决策树用信息增益作为属性选取，C4.5决策树用增益率作为属性选取名单它们都有各自的缺陷，所以最终提出了 CART ，目前 sklearn 中用到的决策树算法也是 CART 。\n",
    "        \n",
    "**CART 决策树**用的是**基尼指数**作为属性选取的标准\n",
    "\n",
    "<img src='img/基尼指数.jpg' width='500px' />\n",
    "\n",
    "* Pk: 每个属性的可选值，将这些可选值计算后累加\n",
    "* 这条公式和信息增益的结果其实是类似的\n",
    "* 当属性分布越平均，也就是信息越模糊的时候，基尼系数值会更大，反之更小。\n",
    "\n",
    "* 但这种计算方法每次都只用二分分类的方式进行计算：\n",
    "    * 比如季节属性（春夏秋冬）\n",
    "        * 计算春季：二分的方式（春，（夏秋冬））\n",
    "    * 后面的步骤与 ID3类似。\n",
    "\n",
    "* 通过这种计算，可以较好地规避 ID3 决策树算法的缺陷。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd956548",
   "metadata": {},
   "source": [
    "***\n",
    "<br>\n",
    "## 过拟合与剪枝\n",
    "<img src='img/过拟合.jpg' width='500px' />\n",
    "\n",
    "1. **过拟合**：过度追求正确性，导致普适性很差。\n",
    "\n",
    "2. **剪枝**：减少树的高度，就是为了解决过拟合问题。——预剪枝、后剪枝\n",
    "    * 预剪枝：自顶向下，当构建的树达到设定高度时，停止建立决策树。（更快）\n",
    "    * 后剪枝：自底向上，先任由决策树构建完成，之后再从底部判断，减掉哪些枝干。\n",
    "    \n",
    "***\n",
    "<br>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3597c159",
   "metadata": {},
   "source": [
    "# 决策树算法案例实战——预测患者佩戴隐形眼镜类型\n",
    "\n",
    "* 数据集有 5 列\n",
    "* 特征——前 4 列：\n",
    "    * age（年龄）\n",
    "    * prescript（症状）\n",
    "    * astigmatic（是否散光）\n",
    "    * tearRate（眼泪数量）\n",
    "    \n",
    "    \n",
    "* 类别标签——最后 1 列（三类）：\n",
    "    * 硬质材料（hard）\n",
    "    * 软材质（soft）\n",
    "    * 不适合佩戴隐形眼镜（no lenses）\n",
    "\n",
    "**tip**\n",
    "\n",
    "        隐形眼镜的特征数据是各种英文单词，格式不合适。所以我们要对数据集进行编码，对文本进行数字化处理，简单来说 Label Encoder 是对不连续的数字或文本进行编号（在使用 Label Encoder 之前要对数据缺失值处理）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ee1a3ff2",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.metrics import accuracy_score\n",
    "import pandas as pd\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "3143e96f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>prescript</th>\n",
       "      <th>astigmatic</th>\n",
       "      <th>tearRate</th>\n",
       "      <th>class</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>young</td>\n",
       "      <td>myope</td>\n",
       "      <td>no</td>\n",
       "      <td>reduced</td>\n",
       "      <td>no lenses</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>young</td>\n",
       "      <td>myope</td>\n",
       "      <td>no</td>\n",
       "      <td>normal</td>\n",
       "      <td>soft</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>young</td>\n",
       "      <td>myope</td>\n",
       "      <td>yes</td>\n",
       "      <td>reduced</td>\n",
       "      <td>no lenses</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>young</td>\n",
       "      <td>myope</td>\n",
       "      <td>yes</td>\n",
       "      <td>normal</td>\n",
       "      <td>hard</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>young</td>\n",
       "      <td>hyper</td>\n",
       "      <td>no</td>\n",
       "      <td>reduced</td>\n",
       "      <td>no lenses</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     age prescript astigmatic tearRate      class\n",
       "0  young     myope         no  reduced  no lenses\n",
       "1  young     myope         no   normal       soft\n",
       "2  young     myope        yes  reduced  no lenses\n",
       "3  young     myope        yes   normal       hard\n",
       "4  young     hyper         no  reduced  no lenses"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 特征标签 + 类别标签\n",
    "lensesLabels = ['age','prescript','astigmatic','tearRate','class']\n",
    "feature = ['age','prescript','astigmatic','tearRate']\n",
    "filename = 'data/lenses.txt'\n",
    "lenses = pd.read_table(filename,names=lensesLabels,sep='\\t')\n",
    "# names:设置列名\n",
    "# sep: 分隔的正则表达式，‘\\t’表示以 tab键 进行分割\n",
    "lenses.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "f35906d7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "正确值：\n",
      "8     no lenses\n",
      "6     no lenses\n",
      "17    no lenses\n",
      "0     no lenses\n",
      "21         soft\n",
      "2     no lenses\n",
      "18    no lenses\n",
      "19         hard\n",
      "Name: class, dtype: object\n",
      "预测值：\n",
      "['no lenses' 'no lenses' 'soft' 'no lenses' 'soft' 'no lenses' 'no lenses'\n",
      " 'hard']\n",
      "准确率：87.500000%\n"
     ]
    }
   ],
   "source": [
    "def main():\n",
    "    # 取出特征数据和特征数据\n",
    "    x_train,y_train = lenses[feature],lenses['class']\n",
    "    # 创建 Label Encoder()对象，用于序列化\n",
    "    le = LabelEncoder()\n",
    "    # 分列序列化（给字符编写数字编号）\n",
    "    for col in x_train.columns:\n",
    "        x_train[col] = le.fit_transform(x_train[col])\n",
    "    # print(x_train)\n",
    "    x_train = x_train.values\n",
    "    x_train, x_test, y_train, y_test = train_test_split(x_train,y_train,test_size=0.3)\n",
    "    \n",
    "    # 创建决策树对象\n",
    "    '''\n",
    "    criterion: 特征选择的标准{entropy：信息熵}\n",
    "    '''\n",
    "    clf = DecisionTreeClassifier(criterion='entropy',random_state=1,max_depth=4)\n",
    "    \n",
    "    # 根据数据，构造决策树\n",
    "    model = clf.fit(x_train,y_train)\n",
    "    \n",
    "    # 预测\n",
    "    y_pred = model.predict(x_test)\n",
    "    \n",
    "    # 输出\n",
    "    print('正确值：\\n{}'.format(y_test))\n",
    "    print('预测值：\\n{}'.format(y_pred))\n",
    "    print('准确率：%f%%'%(accuracy_score(y_test,y_pred)*100))\n",
    "    \n",
    "if __name__=='__main__':\n",
    "    main()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4fb85dab",
   "metadata": {},
   "source": [
    "# 决策树算法实战案例——电影喜好预测\n",
    "\n",
    "* 电影类型 type: \n",
    "    * anime(动画片)\n",
    "    * science（科幻片）\n",
    "    * action（动作片）\n",
    "* 国家 country:\n",
    "    * Japan\n",
    "    * America\n",
    "    * China\n",
    "    * France\n",
    "* 票房 gross:\n",
    "    * low\n",
    "    * high\n",
    "* 看过 watch:\n",
    "    * yes\n",
    "    * no"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "de8a476c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import csv\n",
    "from sklearn.feature_extraction import DictVectorizer\n",
    "from sklearn import preprocessing\n",
    "from sklearn import tree\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "05be782a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['id', 'type', 'country', 'gross', 'watch']\n",
      "['1', 'anime', 'Japan', 'low', 'yes']\n",
      "['2', 'science', 'America', 'low', 'yes']\n",
      "['3', 'anime', 'America', 'low', 'yes']\n",
      "['4', 'action', 'America', 'high', 'yes']\n",
      "['5', 'action', 'China', 'high', 'yes']\n",
      "['6', 'anime', 'China', 'low', 'yes']\n",
      "['7', 'science', 'France', 'low', 'no']\n",
      "['8', 'action', 'China', 'low', 'no']\n",
      "['yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'no', 'no']\n",
      "[{'type': 'anime', 'country': 'Japan', 'gross': 'low'}, {'type': 'science', 'country': 'America', 'gross': 'low'}, {'type': 'anime', 'country': 'America', 'gross': 'low'}, {'type': 'action', 'country': 'America', 'gross': 'high'}, {'type': 'action', 'country': 'China', 'gross': 'high'}, {'type': 'anime', 'country': 'China', 'gross': 'low'}, {'type': 'science', 'country': 'France', 'gross': 'low'}, {'type': 'action', 'country': 'China', 'gross': 'low'}]\n"
     ]
    }
   ],
   "source": [
    "filename = 'data/film.csv'\n",
    "film_data = open(filename,'rt')\n",
    "# 从 csv 文件里读取数据\n",
    "reader = csv.reader(film_data)\n",
    "# next(reader)  # 跳过第一行(field) 并返回第一行的数组\n",
    "# 表头信息\n",
    "headers = next(reader)\n",
    "print(headers)\n",
    "\n",
    "feature_list = [] # 特征值feature\n",
    "result_list = []  # 结果result\n",
    "\n",
    "for row in reader :\n",
    "    print(row)\n",
    "    # 结果集\n",
    "    result_list.append(row[-1])\n",
    "    # 特征集\n",
    "    feature_list.append(dict(zip(headers[1:-1],row[1:-1])))\n",
    "    \n",
    "print(result_list)\n",
    "\n",
    "print(feature_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25f06051",
   "metadata": {},
   "source": [
    "**with…as语句执行顺序：**\n",
    "\n",
    "–>首先执行expression里面的__enter__函数，它的返回值会赋给as后面的variable，想让它返回什么就返回什么，只要你知道怎么处理就可以了，如果不写as variable，返回值会被忽略。\n",
    "\n",
    "\n",
    "–>然后，开始执行with-block中的语句，不论成功失败(比如发生异常、错误，设置sys.exit())，在with-block执行完成后，会执行expression中的__exit__函数。\n",
    "\n",
    "\n",
    "当with...as语句中with-block被执行或者终止后，这个类对象应该做什么。如果这个码块执行成功，则exception_type,exception_val, trace的输入值都是null。如果码块出错了，就会变成像try/except/finally语句一样，exception_type, exception_val, trace 这三个值系统会分配值。\n",
    "\n",
    "> import csv\n",
    "> \n",
    "> filename = 'data.csv'\n",
    "> \n",
    "> with open(filename, 'r') as file:\n",
    ">\n",
    ">     reader = csv.reader(file)\n",
    ">\n",
    ">     next(reader)  # 跳过第一行(field)\n",
    "> \n",
    ">     for row in reader:\n",
    ">         # 在这里可以对每一行进行处理\n",
    ">         print(row)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a68ff65e",
   "metadata": {},
   "source": [
    "***\n",
    "<br>\n",
    "\n",
    "#### tip\n",
    "\n",
    "* sk-learn 所有输入都要 numpy array\n",
    "* DictVectorizer 对非数字化的处理方式是：借助原特征的名称，组成新的特征\n",
    "* 并采用 0/1 的方式进行量化，而数值型的特征转化比较方便，一般情况维持原值即可\n",
    "* 输出转化后的特征矩阵\n",
    "* toarray 方法将 dict 类型的 list 数据转换成 numpy arra\n",
    "\n",
    "<img src='img/vec.jpg' width='600px' />"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "218a8858",
   "metadata": {},
   "source": [
    "### 字典特征二值化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "5814897a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0. 0. 0. 1. 0. 1. 0. 1. 0.]\n",
      " [1. 0. 0. 0. 0. 1. 0. 0. 1.]\n",
      " [1. 0. 0. 0. 0. 1. 0. 1. 0.]\n",
      " [1. 0. 0. 0. 1. 0. 1. 0. 0.]\n",
      " [0. 1. 0. 0. 1. 0. 1. 0. 0.]\n",
      " [0. 1. 0. 0. 0. 1. 0. 1. 0.]\n",
      " [0. 0. 1. 0. 0. 1. 0. 0. 1.]\n",
      " [0. 1. 0. 0. 0. 1. 1. 0. 0.]]\n",
      "['country=America' 'country=China' 'country=France' 'country=Japan'\n",
      " 'gross=high' 'gross=low' 'type=action' 'type=anime' 'type=science']\n"
     ]
    }
   ],
   "source": [
    "# 初始化字典特征抽取器\n",
    "vec = DictVectorizer()\n",
    "\n",
    "dummyX = vec.fit_transform(feature_list).toarray()\n",
    "print(dummyX)\n",
    "\n",
    "#打印结果\n",
    "# country     |,gross|,type\n",
    "# 0，0，0 , 0 |0，0，|0 , 0，0\n",
    "# 注意，dummyX是按首字母排序的 'country','gross','type'\n",
    "\n",
    "# 输出各个维度特征的含义\n",
    "print(vec.get_feature_names_out())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1da08f18",
   "metadata": {},
   "source": [
    "### 列表特征二值化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "3ede0747",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[1]\n",
      " [1]\n",
      " [1]\n",
      " [1]\n",
      " [1]\n",
      " [1]\n",
      " [0]\n",
      " [0]]\n"
     ]
    }
   ],
   "source": [
    "# 非数字标签进行二值化 LabelBinarizer\n",
    "# 数值型的 Binarizer\n",
    "dummyY = preprocessing.LabelBinarizer().fit_transform(result_list)\n",
    "print(dummyY)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "82806541",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "clf: DecisionTreeClassifier(criterion='entropy', random_state=1)\n"
     ]
    }
   ],
   "source": [
    "'''\n",
    "1. 构建分类器——决策树模型\n",
    "2. 使用数据训练决策树模型\n",
    "\n",
    "'''\n",
    "clf = tree.DecisionTreeClassifier(criterion='entropy',random_state=1)\n",
    "clf = clf.fit(dummyX,dummyY)\n",
    "\n",
    "'''\n",
    "fit() 可以说是调用的通用方法，fit(X,Y)为监督学习算法\n",
    "线性模型中的 fit 其实是一个进行学习的过程，根据数据和标签进行学习\n",
    "fit 就是开始学习（如果数据很大，可以发现需要执行很长时间）\n",
    "'''\n",
    "print(\"clf: \" + str(clf))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "80870680",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pydotplus\n",
    "import os\n",
    "\n",
    "# 根据你安装Graphviz路径修改\n",
    "os.environ['PATH'] += os.pathsep + 'X:/graphviz/bin'\n",
    "\n",
    "dot_data = tree.export_graphviz(clf\n",
    "                               ,feature_names=vec.get_feature_names_out()\n",
    "                                ,filled=True\n",
    "                                ,rounded=True\n",
    "                               ,special_characters=True\n",
    "                                ,out_file=None)\n",
    "# print(dot_data)\n",
    "dot_data=dot_data.replace('\\n','')\n",
    "# print(dot_data)\n",
    "graph = pydotplus.graph_from_dot_data(dot_data)\n",
    "graph.write_pdf('film2.pdf')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5924d543",
   "metadata": {},
   "source": [
    "<img src='img/去掉转义字符.jpg' width='600px' />"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "3b3ecc50",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "日本低票房的动画片，想看吗？ [1]\n",
      "法国低票房的动画片，想看吗？ [0]\n",
      "美国高票房的动作片，想看吗？ [1]\n"
     ]
    }
   ],
   "source": [
    "# 开始预测，测试机数据\n",
    "A = ([[0,0,0,1,0,1,0,1,0]])\n",
    "B = ([[0,0,1,0,0,1,0,1,0]])\n",
    "C = ([[1,0,0,0,1,0,1,0,0]])\n",
    "\n",
    "predict_resultA = clf.predict(A)\n",
    "predict_resultB = clf.predict(B)\n",
    "predict_resultC = clf.predict(C)\n",
    "\n",
    "print(\"日本低票房的动画片，想看吗？ \" + str(predict_resultA))\n",
    "print(\"法国低票房的动画片，想看吗？ \" + str(predict_resultB))\n",
    "print(\"美国高票房的动作片，想看吗？ \" + str(predict_resultC))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71eee47f",
   "metadata": {},
   "source": [
    "***\n",
    "<br>\n",
    "\n",
    "\n",
    "# 总结\n",
    "1. sklearn 只实现了 ID3(entropy) 和 CART(gini) 决策树。分枝方式用超参数 criterion决定。如果分类分布非常不均匀，考虑用 class_weight参数来限制模型过于偏向样本多的类别。\n",
    "\n",
    "2. 特征数据如果是连续函数，需要先对数据进行离散化处理（score：0-100，1-59：不及格，60-80：达标，80-100优秀）\n",
    "\n",
    "3. 最大化信息增益来选择特征，容易造成选择类别多的特征进行分类。（因为这样子数据集最纯净，但不是我们想要的）\n",
    "        \n",
    "        计算划分后的子数据集的信息熵时，加上一个与类别个数成正比的正则项来作为最后的信息熵，当算法选择类别多的特征使信息熵较小时，由于收到正则项的“惩罚”，导致最终的信息熵较大。从而达到平衡，\n",
    "        \n",
    "4. 样本数量少但是样本特征非常多时，容易过拟合。样本数 > 特征数，模型更具鲁棒性。\n",
    "\n",
    "5. 对于 4 所提出的问题，我们可以在建模前先做维度规约，eg.主成分分析（PCA），再拟合\n",
    "\n",
    "6. 推荐多使用决策树的可视化，同时先限制决策树的深度 max_depth=3，观察初步拟合结果，再决定是否加深。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa8d471c",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b22d4143",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9ed195dd",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7142a2db",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "671ebb90",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d419b8c3",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
